Diacritics restoration based on word n-grams for Slovak texts
نویسندگان
چکیده
Abstract Despite the modern boom in technology, we are still faced with fact that people write texts without diacritics. There two main reasons for this. The first, historical reason stems from past when use of diacritics was troublesome and would text them. second one is speed - typing usually faster. Text easy to understand people, but some types documents, missing can cause a problem. This also an issue computers process such text. In this paper, propose algorithm based on word n-grams (a contiguous sequence n words) restore written Slovak language. We compare evaluate our results other algorithms developed
منابع مشابه
Diacritics Restoration in Romanian Texts
There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depending on the functional role of the diacritical characters. For Romanian, automatic restoration o...
متن کاملDiacritics restoration for Arabic dialect texts
Vocalization, diactritization or diacritics restoration is one of the major challenges in Arabic natural language processing. Algiers dialect is also concerned by this issue. In this paper, we present an automatic diacritization system for standard and dialect Arabic texts based on statistical approach. The idea is to use available tools in statistical machine translation to build such a system...
متن کاملBeyond Word N-Grams
We describe, analyze, and experimentally evaluate a new probabilistic model for wordsequence prediction in natural languages, based on prediction suffi~v trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These ...
متن کاملAutomatic Diacritics Restoration for Hungarian
In this paper, we describe a method based on statistical machine translation (SMT) that is able to restore accents in Hungarian texts with high accuracy. Due to the agglutination in Hungarian, there are always plenty of word forms unknown to a system trained on a fixed vocabulary. In order to be able to handle such words, we integrated a morphological analyzer into the system that can suggest a...
متن کاملComparing Word Relatedness Measures Based on Google $n$-grams
Estimating word relatedness is essential in natural language processing (NLP), and in many other related areas. Corpus-based word relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based measures in the literature that can not be compared to each other as they use a different corpus. The purpose of this paper is to show how to evaluate different corpu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Open Computer Science
سال: 2021
ISSN: ['2299-1093']
DOI: https://doi.org/10.1515/comp-2020-0143